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Abstract 

The diversity of populations in domestic species offers great opportunities to study genome re- 
sponse to selection. The recently published Sheep HapMap dataset is a great example of characteri- 
zation of the world wide genetic diversity in sheep. In this study, we re-analyzed the Sheep HapMap 
dataset to identify selection signatures in worldwide sheep populations. Compared to previous anal- 
yses, we made use of statistical methods that (i) take account of the hierarchical structure of sheep 
populations, (ii) make use of linkage disequilibrium information and (iii) focus specifically on either 
recent or older selection signatures. We show that this allows pinpointing several new selection sig- 
natures in the sheep genome and distinguishing those related to modern breeding objectives and to 
earlier post-domestication constraints. The newly identified regions, together with the ones previously 
identified, reveal the extensive genome response to selection on morphology, color and adaptation to 
new environments. 

Introduction 

Domestication of animals and plants has played a major role in human history. With the advance of 
high-throughput genotyping and sequencing technologies, the analysis of large datasets in domesticated 
species offers great opportunities to study genome evolution in response to phenotypic selection [1]. The 
sheep was one of the first grazing animals to be domesticated [2] in part due to its manageable size and an 
ability to adapt to different climates and diets with poor nutrition. A large variety of breeds with distinct 
morphology, coat color or specialized production (meat, milk or wool) were subsequently shaped by 
artificial selection. Since the release of the 50K SNP array [3] , it is now possible to scan genetic diversity in 
sheep in order to detect loci that have been involved in these various adaptive selection events. The Sheep 
HapMap dataset, which includes 50K genotypes for 3000 animals from 74 breeds with diverse world-wide 
origins, provides a considerable resource for deciphering the genetic bases of phenotype diversification in 
sheep. In the first analysis of this dataset [4], the authors looked for selection by computing a global 
Fst among the 74 breeds at all SNP in the genome. They identified 31 genome regions with extreme 
differentiation between breeds, which included candidate genes related to coat pigmentation, skeletal 
morphology, body size, growth, and reproduction. Further studies took advantage of the Sheep HapMap 
resource to detect genetic variants associated with pigmentation [5] , fat deposition [6] , or microphtalmia 
disease [7]. An other study [8] performed a genome scan for selection focused on American synthetic 
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breeds, using an F$t approach similar to that in [4]. 

The 74 breeds of the Sheep HapMap dataset have a strong hierarchical structure, with at least 3 
distinct differentiation levels: an inter-continental level (e.g. European breeds vs Asian breeds), an intra- 
continental level (e.g. Texel vs Suffolk European breeds), and an intra-breed level (e.g. German Texel 
vs Scottish Texel flocks). Recent studies [9-12] showed that, when applied to hierarchically structured 
data sets, F$t based genome scans for selection may lead to a large proportion of false positives (neutral 
loci wrongly detected as under selection) and false negatives (undetected loci under selection). Besides, 
the heterogeneity of effective population size among breeds implies that some breeds are more prone to 
contribute large locus-specific F$t values than others [10]. Apart from these statistical considerations, 
merging populations with various degrees of shared ancestry can limit our understanding of the selective 
process at detected loci. Indeed, the regions pointed out in [4] can be related to either ancient selection, 
as the poll locus which has likely been selected for thousands of years, or fairly recent selection, as the 
myostatin locus which has been specifically selected in the Texel breed. But in most situations the time 
scale of adaptation cannot be easily determined. 

Another limit of genome scans for selection based on single SNP F$t computations is that they do 
not sufficiently account for the very rich linkage disequilibrium information, even when the single SNP 
statistics are combined into windowed statistics. Recently, we proposed a new strategy to evaluate the 
haplotype differentiation between populations [13]. We showed that using this approach greatly increases 
the detection power of selective sweeps from SNP chip data, and also enables to detect soft or incomplete 
sweeps. These latter selection scenarios are particularly relevant in breeding populations, where selection 
objectives have likely varied along time and where the traits under selection are often polygenic. 

In this study we provide a new genome scan for selection based on the Sheep HapMap dataset, where 
we distinguish selective sweeps within and between 7 broad geographical groups. The within group 
analysis aims at detecting recent selection events related to the diversification of modern breeds. It is 
based on the single marker FLK test [10] and on its haplotypic extension hapFLK [13]. The FLK test is 
an extension of the Lewontin and Krakauer (LK) test [14] that accounts for population size heterogeneity 
and for the hierarchical structure between populations. As the LK test, the FLK test computes a global 
Fst for each SNP, but allele frequencies are first rescaled using a population kinship matrix F. This 
matrix, which is estimated from the observed genome wide data, measures the amount of genetic drift that 
can be expected, under neutral evolution, along all branches of the population tree. With this rcscaling, 
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allele frequency differences are typically down-weighted if they are obtained with small populations, or 
populations that diverged a long time ago. The between group analysis focuses on older selection events 
and is only based on FLK. Overall, we confirmed 19 of the 31 sweeps discovered in [4], while providing 
more details about the past selection process at these loci. We also identified 71 new selection signatures, 
with candidate genes related to coloration, morphology or production traits. 

Results and discussion 

We detected selection signatures using methods that aim at identifying regions of outstanding genetic 
differentiation between populations, based either on single SNP, FLK [10], or haplotype, hapFLK [13], 
information. These methods have optimal power when working on closely related populations so we 
separately analyzed seven groups of breeds, previously identified as sharing recent common ancestry 
[4] and corresponding to geographical origins of breeds. Before performing genome scans for selection 
signatures, we studied the population structure of each group to identify outlier animals as well as 
admixed and strongly bottlenecked populations, using both PCA and model-based approaches [15, 16]. 
hapFLK was found to be robust to bottlenecks or moderate levels of admixture, but these phenomena 
may affect the detection power so we preferred to minimize their influence by removing suspect animals 
or populations. Details of these corrections are provided in the methods section. The final composition 
of population groups are given in Table 1. 

Overview of selected regions 

An overview of selection signatures on the genome across the different groups is plotted in Figure 1 and 
a detailed description is provided in Table 2. Detected regions were typically a few megabases long and 
included from 1 to 196 genes, with a median of 15 genes. However, in many regions strong functional 
candidate genes were found very close to the position with lowest p-value, typically among the two 
closest genes from this position. These genes are reported in Table 2, as well as a few other functional 
candidates with less statistical evidence but strong prior knowledge from the literature. We found 41 
selection signatures with hapFLK and 26 with FLK, although we allowed a slightly higher false discovery 
rate for FLK than hapFLK (10% vs 5%). This result was consistent with a higher power for hapFLK 
than FLK, as already shown in [13]. 
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100 Four regions were found with both the single SNP and the haplotype test and harbor strong candidate 

101 genes: NPR2, KIT, RXFP2 and EDN3 (Table 2). The overlap was thus small, illustrating that the two 

102 tests tend to capture different signals. In particular, hapFLK will fail to detect ancient selective sweeps, 

103 for which the mutation-carrying haplotype is small and not associated with many SNP on the chip. On 

104 the contrary, single SNP tests will fail to capture selective sweeps when a single SNP is not in high LD 

105 with the causal mutation. They will also fail if the selected mutation is only at intermediate frequency 

106 but is associated to a long haplotype, in contrast with hapFLK. 

107 Six regions were detected in more than one group of breeds. They all contained strong candidate 
ios genes (Table 2). Three of these genes are related to coat color (KIT, KITLG and MC1R), and could 
log correspond to independent selection events (see discussion below). One region harbors a gene (RXFP2) for 
no which polymorphisms have been shown to affect horn size and polledness in the Soay [17] and Australian 
in Merino [18]. We detected this region in 4 different groups and in all of them the highest FLK value was 
n2 found to be very close to RXFP2 (Figure S8). This provides clear evidence that selection in this region 
in is related to RXFP2, consistent with previous selection signatures detected by comparing specifically 
n4 horned and polled breeds (Figure 6 in [4]). However, we note that the signatures of selection in this 
us region exhibit different patterns among groups. The signal is very narrow in the SWE and SWA groups, 
lie and is in fact not detected by the hapFLK test, whereas it affects a large genome region in the CEU 
n7 group where it is detected by hapFLK. In the ITA group, the FLK statistics do not reach significance, 
us and the hapFLK signal is not high (minimum q- value of 0.04). Overall, the selection signatures suggest 
us that selection on RXFP2, most likely due to selection on horn phenotypes, was carried out worldwide 

120 at different times and intensities. Another region harbors the HMGA2 gene, involved in selection for 

121 stature in dogs [19]. The last region includes two interesting candidate genes : ABCG2, which has been 

122 associated to a strong QTL for milk production in cattle [20], and NCAPG, which has been associated 

123 to fetal growth [21] and calving ease [22] in cattle and which is located in several selection signatures in 

124 this species [23-26]. In our analysis, populations with a selection signature in this region belong to three 

125 European groups (SWE, ITA and CEU) and our results suggest that selection in these different groups 

126 might imply distinct genes (Table 2). 

127 In the paper presenting the Sheep HapMap dataset [4], 31 selection signatures were found, correspond- 
i2s ing to the 0.1% highest single SNP F$t- Using FLK and hapFLK, we confirmed signatures of selection 
129 for 10 of these regions. Considering the two analyses were performed on the same dataset, this overlap 



Downloaded from http://biorxiv.org/on September 18, 2014 



G 

130 can be considered as rather small. Two reasons can explain this. 

131 First, the previous analysis was based on the F$t statistic. Although this statistic is commonly used 

132 for selection scans, it is prone to produce false positives when the population tree harbors unequal branch 

133 lengths (i.e. unequal effective population sizes) [10]. In particular, strongly bottlenecked breeds will 

134 contribute high F$t values preferentially even under neutral evolution, because their smaller effective 

135 population size implies a larger variance of allele frequencies. With FLK and hapFLK, F$t values 

136 between populations are rescaled using branch lengths, so populations with long branch lengths will not 

137 contribute more than others [13]. In fact they will tend to contribute less, as the statistical power to 
us distinguish selective effects from drift effects is naturally lower in populations where drift is larger. 

139 Second, the previous analysis was performed using all breeds at the same time. It is therefore possible 

mo that some of these regions correspond to differentiation between groups of breeds rather than within 

141 groups. To investigate this question, we performed a genome scan for selection between the ancestors of 

142 the seven population groups using the FLK statistic computed on their estimated allele frequencies [10]. 

143 We did not include SNP lying in regions detected within groups since selection biases their estimated 

144 ancestral allele frequencies. The population tree was reconstructed using SNP for which we have unam- 

145 biguous ancestral allele information (Figure S9). The tree is decomposed into two main lineages, one 

146 for European breeds and one for Asian and African breeds. The African group exhibits a slightly higher 
14? branch length. We note, however, that this could be due to ascertainment bias of SNP on the SNP array, 
us This led to the identification of 23 new selection signatures (Figure 2 and Table 3), 9 of them being 

149 common to the analysis of [4]. Overall, combining the scans for recent and ancestral selection, we failed 

150 to replicate 12 of the regions in [4]. 

151 Selection Signatures within population groups 

152 Coloration Many selection signatures are located around genes that have been shown to be involved in 

153 hair, eye or skin color. In particular, several detected regions include candidate genes that are involved in 

154 the development and migration of melanocytes and in pigmentation : EDN3, KIT, KITLG, MC1R and 

155 MITF. For all these genes except MITF, we have quite strong evidence that they are the genes targeted 

156 by selection in the detected region. In the SWA group, EDN3 was included in the detected region for 

157 both FLK and hapFLK, and in both cases it was the closest gene to the highest test value. KIT and 

158 KITLG were both included in a detected region (with relatively few genes) for two different geographical 
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159 groups, and were very close to the position with the smallest p-value in one of those. MC1R was also in 

wo a detected region for two different groups, NEU and ITA. In the two cases it was not very close to the 

lei maximum of the signal, but we note that the black skin or coat color is an important characteristic of 

162 the two populations that have been found under selection in this region, the Irish Suffolk and Sardinian 

163 Ancestral Black. This observation, together with the fact that MC1R mutations are responsible for coat 

164 color patterns in mammals (e.g in cattle [27]), supports the hypothesis that MC1R is a good candidate 

165 for the signatures we observed. 

we Although not listed in Table 2, SOX10 and ASIP, two other genes implied in pigmentation, also 

167 show some evidence of selection. In the ITA group, the q-value of hapFLK near SOX10 is 6.2% and 

i6s almost reaches the significance threshold of 5%. Similarly, the two closest SNP to ASIP (s66432 and 

169 sl2884) present suggestive FLK p-values of respectively 7.5 10 -4 and 6.8 in the ASI group, and 

170 one (sl2884) is significantly differentiated between the ancestral groups. All these genes have previously 

171 been reported as being likely selection targets and/or associated to color patterns in different mammalian 

172 species. Finally, we found a signal for selection centered on the BNC2 gene, that has recently been 

173 associated with skin pigmentation in humans [28]. All population groups present at least one selection 

174 signature which is very likely related to one of the above genes, reflecting the widespread importance of 

175 color patterns to define sheep breeds. 

176 Inferring a precise history of underlying causal mutations for color patterns in this dataset is hard 

177 for several reasons: the precise phenotypic characterizations of coat color patterns in the Sheep HapMap 
i7s breeds are not available; the 50K SNP array used does not offer sufficient density to associate a given 
179 selection signature to a specific set of polymorphisms; finally, from the literature, it appears that coat 
lso color is a complex trait, with high genetic heterogeneity. In particular, mutations in different genes can 
lai give rise to the same phenotype {e.g. in horses [29]). Also, within a gene different mutations can give 

182 rise to different phenotypes, e.g mutations in the MC1R gene (also named the extension locus) have been 

183 associated to a large panel of skin or coat colors [27,30,31]. Deciphering selection signatures related to 

184 coat color in sheep and in particular identifying the causal variants under selection will require sequencing 

185 these genes for individuals from several breeds with diverging color patterns. This in turn will help to 

186 understand the evolutionary history of the breeds and the effect of selection [32] . To potentially help in 

187 this task, in Table SI we list, for each "color gene", the populations that have likely been selected for. 
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las Morphology Another group of genes that are found within selection signatures have known effects 

las on body morphology and development. NPR2, HMGA2 and BMP2, pointed out previously [4] are 

190 confirmed as good positional candidates by our study. We also found strong evidence for selection on 

191 WNT5A, ALX4 or EXT2, and two HOX gene clusters (HOXA and HOXC). WNT5A and ALX4 are 

192 two genes involved in the development of the limbs and skeleton. Mutations in WNT5A are causing the 

193 dominant Human Robinow syndrome, characterized by short stature, limb shortening, genital hypoplasia 

194 and craniofacial abnormalities [33]. ALX4 loss of function mutations cause polydactily in the mouse, 

195 through disregulation of the sonic hedgehog (SHH) signaling factor [34,35]. Moreover, the ALX4 protein 

196 has been shown to bind proteins from the HOXA (HOXAll and HOXA3) and HOXC (HOXC4 and 

197 HOXC5) clusters [36]. Located just besides ALX4 and corresponding to the same selection signature, 

198 EXT2 is responsible for the development of exostose in the mouse [37] . HOX genes are responsible for 

199 antero-posterior development and skeletal morphology along the anterior-posterior axis in vertebrates. 

200 The selection signature around HOXA is a recent selection signature in the SWA group, while that around 

201 HOXC is an ancestral signature with a high differentiation of the ASI ancestor compared to AFR and 

202 SWA (Table 3). 

203 Finally, we note that an ancestral selection signature is found near the ACAN gene, whose expression 

204 was shown to be upregulated by BMP2 [38], another candidate gene for selection. Three genes within 

205 the selection signature are found closer to the maximum test value than ACAN, but these are in silico 

206 predicted genes, whose protein coding function has not been confirmed, so ACAN seems to be overall a 
20? better candidate for explaining selection in the region. Mutations in the ACAN gene have been shown 

208 to induce osteochondrosis [39] and skeletal dysplasia [40]. The ACAN region has also been shown to be 

209 associated with height in humans [41]. 

210 Traits of agronomic importance Sheeps have been raised for meat, milk and wool production. 
2u Under selection signatures, we found several genes associated with these production traits. In addition 

212 to the selection signature in Texels on the MSTN gene for increased muscularity [42], discussed in [13], 

213 we detected a selection signature centered on HDAC9 and including few other genes, which could also be 

214 linked to muscling. HDAC9 is a known transcriptional repressor of myogenesis. Its expression has been 

215 shown to be affected by the callypige mutation in the sheep at the DLK1-DI03 locus [43]. The signature 

216 around HDAC9 corresponds to a selection signature in the Garut breed from Indonesia, a breed used 
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217 in ram fights. As already discussed, one selection signature contains ABCG2, a gene underlying a QTL 

2is with large effects on milk production (yield and composition) in cattle [20]. Also, one of the ancestral 

219 selection signatures reaches its maximum value close to the INS1G2 gene, recently shown to be associated 

220 with milk fatty acid composition in Holstein cattle [44] . Two selection signatures could be related to wool 

221 characteristics, one in the CEU group including the FGF5 gene, partly responsible for hair type in the 

222 domestic dog [45,46], and an ancestral selection signature on chromosome 25 in a QTL region associated 

223 to wool quality traits in the sheep [47,48]. 

224 One of the strong outlying regions in the selection scan contains the PITX3 gene. Further analysis 

225 revealed that this signature was due to the German Texel population haplotype diversity differing from 

226 the other Texel samples (results not shown). It turns out that the German Texel sample consisted of 

227 a case/control study for microphtalmia [7], although the case/control status information in this sample 
22s is not given in the Sheep HapMap dataset. The consequence of such a recruitment is to bias haplotype 

229 frequencies in the region associated with the disease, which provokes a very strong differentiation signal 

230 between the German Texel and the other Texel populations. Although not related to artificial or natural 

231 selection in sheep, this signature illustrates that our method for detecting selection has the potential to 

232 identify causal variants in case/control studies, while using haplotype information. 

233 Ancestral signatures of selection 

234 It is difficult to estimate how far back in time signatures of selection found in the ancestral tree took place. 

235 In particular, it would be interesting to place the divergences shown by the ancestral population tree with 

236 respect to sheep domestication. Two interesting candidate genes for ancestral selection signatures might 

237 indicate that the selection signatures captured could be rather old. First, we found selection near the 

238 TRPM8 gene, which has been shown to be a major determinant of cold perception in the mouse [49]. 

239 The pattern of allele frequency at the significant SNP (Table 3) is consistent with the climate in the 

240 geographical origins of the population groups. AFR, ASI and ITA, living in warm climates, have low 

241 frequency (0.04-0.16) of the A allele, while NEU and CEU, from colder regions, have higher frequencies 

242 (0.55-0.7), the SWE group having an intermediate frequency of 0.38. Overall, this selection signature 

243 might be due to an adaptation to cold climate through selection on a TRPM8 variant. Another selection 

244 signature lies close to a potential chicken domestication gene, TSHR [50], whose signaling regulates 

245 photoperiodic control of reproduction [51]. This selection signature was identified before [4] and our 
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246 analysis indicates that selection happened before the divergence of breeds within geographic groups, 

247 consistent with an early selection event. Given its role, we can speculate that selection on the TSHR 
24s gene is related to seasonality of reproduction. Under temperate climates, sheep experience a reproductive 

249 cycle under photoperiodic control. Furthermore, there is evidence that this control was altered during 

250 domestication [52] so our analysis suggests genetic mutations in TSHR may have contributed to this 

251 alteration. 

252 As discussed above, some of the genes found underlying ancestral selection signatures can be related 

253 to production or morphological traits (e.g. ASIP, INSIG2, ACAN, wool QTL), indicating that these traits 

254 have likely been important at the beginning of sheep history. The other genes that we could identify 

255 as likely selection targets in the ancestral population tree relate to immune response (GATA3) and in 

256 particular to antiviral response (TMEM154 [53], TRAF3 [54]). The most significant ancestral selection 

257 signature is centered around the NF1 gene, encoding neurofibromin. This gene is a negative regulator 

258 of the ras signal transduction pathway, therefore involved in cell proliferation and cancer, in particular 

259 neurofibromatosis. Due to this central role in intra-cellular signaling, mutations affecting this gene can 

260 have many phenotypic consequences so that its potential role in the adaptation of sheep breeds remains 

261 unclear. 

262 Conclusions 

263 The Sheep HapMap dataset is an exceptional resource for sheep genetics studies. In a population genomics 

264 context, our study shows that the rich information contained in these data permits to start unraveling the 

265 genetic history of sheep populations worldwide. In order to fully exploit this information, we used recent 

266 statistical approaches that account for the relationship between populations and the linkage disequilibrium 

267 patterns (haplotype diversity). This allowed detecting with confidence more selection signatures and 

268 identifying for most of them the selected populations. Among these new selection signatures detected 

269 by our study, several result from recent selection and include good positional candidate genes with 

270 functions related to pigmentation (KITLG, EDN3), morphology (WNT5A, ALX4, EXT2, HOXA cluster) 

271 or production traits (HDAC9). Two ancestral selection signatures are also of particular interest as they 

272 harbor genes (TRPM8 and TSHR) whose functions (cold and photoperiodic perception respectively) seem 

273 highly relevant to the selection response during the early history of domestic sheep. 
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274 With information on adaptive genome regions and selected populations, we hope that our work will 

275 foster new studies to unravel the underlying biological mechanisms involved. To this aim, it is likely that 

276 further phcnotypic and genetic data are required. On the genetics side, even though the SNP array used 

277 in this study was sufficient to localize genome regions harboring adaptive mutations, its density and the 
27s SNP ascertainment bias resulting from its design did not allow to tag the causative mutation precisely. 

279 Elucidating the causal variation underlying selection signatures will thus most likely require large scale 

280 sequencing data. 

281 Genome scans for selection, including this one, are identifying regions that are outliers from a statis- 

282 tical model and do not require to specify an alternative hypothesis based on phenotypic records. While 

283 this can be seen as an advantage for the initial localization of genome regions, it is a limitation for the 

284 identification of biological processes involved. Gathering phenotypic records in specific populations, in 

285 particular for color and morphology traits, will be needed to go further. 

286 Methods 

287 Selecting populations and animals Seventy-four breeds are represented in the Sheep HapMap data 

288 set, but we only used a subset of these breeds in our genome scan. We removed the breeds with small 

289 sample size (< 20 animals), for which haplotype diversity cannot be determined with sufficient precision. 

290 Based on historical information, we also removed all breeds resulting from a recent admixture or having 

291 experienced a severe recent bottleneck. Focusing on the remaining breeds, we then studied the genetic 

292 structure within each population group, in order to detect further admixture events. We performed a 

293 standardized PCA of individual based genotype data and applied the admixture software [16]. 

294 In two population groups (AFR and NEU) the different breeds were clearly separated into distinct 

295 clusters of the PCA and showed no evidence of recent admixture (Figures SI and S2). These samples 

296 were left unchanged for the genome scan for selection. A similar pattern was observed in three other 

297 groups (ITA, SWA, ASI), except for a few outlier animals that had to be re-attributed to a different breed 

298 or simply removed (Figures S3, S4 and S5). In the two last groups (CEU and SWE), several admixed 

299 breeds were found and were consequently removed from the genome scan analysis (Figures S6 and S7). 

300 We performed a genome scan within each group of populations listed in Table 1, with a single SNP 

301 statistic FLK [10] and its haplotype version hapFLK [13]. 
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302 Population trees Both statistics require estimating the population tree, with a procedure described in 

303 details in [10]. Briefly, we built a population tree for each group by first calculating Reynolds' distances 

304 between each population pair, and then applying the Neighbor Joining algorithm on the distance matrix. 

305 For each group, we rooted the tree using the Soay sheep as an outgroup. This breed has been isolated on 

306 an Island for many generations and exhibits a very strong differentiation with all the breeds of the Sheep 

307 HapMap dataset, making it well suited to be used as an outgroup. 

3os FLK and hapFLK genome scans The FLK statistic was computed for each SNP within each group. 

309 The evolutionary model underlying the FLK statistic assumes that SNP were already polymorphic in 

310 the ancestral population. To consider only loci that most likely match this hypothesis, we restricted our 
3n analysis within each group to SNP for which estimated ancestral minor allele frequency po was above 5%. 

312 Under neutrality, the FLK statistic should follow a \ 2 distribution with n—1 degrees of freedom (DF), 

313 where n is the number of populations in the group. Overall, the fit of the theoretical distribution to the 

314 observed distribution was very good (supporting information Text SI) with the mean of the observed 

315 distribution (FLK) being very close to n — 1 (Table S3). Using FLK as DF for the x 2 distribution 

316 provided a better fit to the observed data than the n—1 theoretical value. We thus computed FLK 

317 p- values using the ^(FLK) distribution. To compute the hapFLK statistic, we used of the Scheet and 
3is Stephens LD model [55] , a mixture model for haplotypes which requires specifying a number of haplotype 

319 clusters to be used. To choose this number, for each group, we used the fastPHASE cross-validation based 

320 estimation of the optimal number of clusters. The results of this estimation are given in Table S2. The 

321 LD model was estimated on unphased genotype data. The hapFLK statistic is computed as an average 

322 over 20 runs of the EM algorithm to fit the LD model. As in [13], we found that the hapFLK distribution 

323 could be modeled relatively well with a normal distribution (corresponding to non outlying regions) and 

324 a few outliers; we used robust estimation of the mean and standard deviation of the hapFLK statistic 

325 to eliminate the influence of outlying (i.e. potentially selected) regions. This procedure was done within 

326 each group, the resulting mean and standard deviation values obtained are given in Table S2. Finally, 

327 we computed at each SNP a p-value for the null hypothesis from the normal distribution. 

328 Selection in ancestral groups The within-group FLK analysis provides for each SNP an estimation 

329 of the allele frequency po in the population ancestral to all populations of the group. We used this 

330 information to test SNP for selection using between group differentiation, with some adjustments. First, 
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331 the FLK model assumes tested polymorphisms are present in the ancestral population. SNP for which the 

332 alternate allele has been seen in only one population group are likely to have appeared after divergence 

333 (within the ancestral tree) and were therefore removed from the analysis. Second, regions selected within 

334 groups affect allele frequency in some breeds and therefore bias our estimation of the ancestral allele 

335 frequency in this group. We therefore removed all SNP that were included in within-group selection 

336 signatures. Finally, the FLK test requires a rooted population tree. For the within group analysis, we 

337 could use a very distant population to the current breeds (the Soay sheep). For the ancestral tree, we 
33s created an outgroup homozygous for ancestral alleles at all SNP. 

339 Identifying selected regions and candidate genes We defined significant regions for each statis- 

340 tic and within each group of populations. Using the neutral distribution (x 2 for FLK and Normal for 

341 hapFLK), we computed the p- value of each statistic at each SNP. To identify selected regions, we esti- 

342 mated their q- value [56] to control the FDR. For FLK, SNP with a q- value below 0.1 were considered 

343 significant, which by definition implies that we expect 10% of false positives among our detected SNP. 

344 Since the power of hapFLK is greater than that of FLK [13], we used a q- value threshold of 0.05, there- 

345 fore controlling FDR at the 5% level. For the FLK analysis in ancestral populations, we used an FDR 

346 threshold of 5%. 

347 We then aimed at identifying genes that seem good candidates for explaining selection signatures. 
34s We proceeded differently for the single SNP FLK and hapFLK. For FLK, we considered that significant 

349 SNP less than 500Kb apart were capturing the same selection signal. Then, we considered as potential 

350 candidate genes any gene that lies less than 1Mb of any significant SNP. For hapFLK, the genome signal 

351 is much more continuous than single SNP tests, because the statistic captures multipoint LD with the 

352 selected mutations. A consequence is that the significant regions can span large chromosome intervals. 

353 To restrict the list of potential candidate genes, and target only the ones closest to the most significant 

354 SNP, we restricted our search to the part of the signal where the difference in hapFLK value with the 

355 most significant SNP was less than 0.5cr. This allowed taking into consideration the profile of the hapFLK 

356 signal, i.e. if the profile resembles a plateau, the candidate region will be rather broad while very sharp 

357 hapFLK peaks will provide a narrower candidate region. We extracted all protein coding genes present in 

358 the significant regions using the Ensembl Biomart tool (http://www.ensembl.org/biomart/) for Ovis 

359 Aries 3.1 genome assembly. These full lists are provided as Supplementary data (Supporting Dataset 1 
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360 and Supporting Datasct 2). Within each candidate region, genes were ranked according to their distance 

361 from the most significant position of the region (the larger the rank, the larger the distance). The 

362 functional candidate genes shown in Table 2 and discussed in the manuscript were chosen based on this 

363 rank and / or on their implication in previous association or sweep detection studies. 
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Figures Legends 



Figure 1. Localization of selection signatures identified in 7 groups of populations. Candidate genes 
are indicated above their genomic localization. Only chromosomes harboring selection signatures are 
plotted. 



Figure 2. Genome scan for selection signature in ancestral populations of the geographical groups. 
Significant SNP at the 5% FDR level are plotted in darker color. 
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Tables 

Table 1. Population groups from the Sheep HapMap dataset used for the detection of 
selection signatures 



Group 



Abbreviation Size Populations (Abbreviations) 



Africa 



AFR 



Red Maasai (RMA) 
Ethiopian Menz (EMZ) 



Asia 



ASI 



Bangladeshi BGE (BGE) 
Bangladeshi Garole (BGA) 
Changthangi (CHA) 
Deccani (IDC) 
Garut (GUR) 
Indian Garole (GAR) 
Sumatra (SUM) 
Tibetan (TIB) 



Central Europe CEU 



Bundner Oberlander (BOS) 
Engadine Red (ERS) 
Valais Blacknose (VBS) 
Valais Red (VRS) 



Italy 



ITA 



Altamurana (ALT) 
Comisana (COM) 
Leccese (LEC) 

Sardinian Ancestral Black (SAB) 



Northern Europe NEU 



Galway (GAL) 

German (GTX), New Zealand (NTX) and Scottish (STX) Texel 

Irish Suffolk (ISF) 

New Zealand Romney (NZR) 



South West Asia SWA 



Afshari (AFS) 
Moghani (MOG) 
Norduz (NDZ) 
Qezel (QEZ) 



South West Europe SWE 



Autralian Merino (MER) 
Churra (CHU) 

Meat (LAM) and Milk (LAC) Lacaune 
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Table 2. Selection signatures in the 7 geographical groups. Regions identified with the hapFLK 
or FLK test, with the corresponding population group and most differentiated populations (except for 
the AFR group). Full names of groups and populations are given in Table 1. The number of genes 
included in each region and the rank of candidate genes within the region is also provided. Overlapping 
regions in different groups or with different tests are grouped by background color, f : signatures of 
selection previously identified [4]. J: this outlying region is not due to evolutionary processes (see 
details in the main text). 



OAR 


Begin 
(Mbp) 


End 
(Mbp) 


P-value 


Q- value 


Group 


Test 


Diff. 
pop. 


Cand. 

gene 


Nb. 

genes 


Rank 


2 


46.65 


57.99 


6.3e-10 


7.1e-07 


ITA 


hapFLK 


COM 


NPR2I 


85 


15 


2 


51.41 


53.44 


4.1e-09 


1.6e-04 


ITA 


FLK 


COM 




41 


2 


2 


74.00 


74.86 


7.4c-04 


3.7e-02 


ITA 


hapFLK 


COM 




7 




2 


81.27 


87.32 


4.1e-09 


2.3e-06 


ITA 


hapFLK 


COM 


BNC2 


18 


1 


2 


110.08 


112.08 


1.5e-05 


6.7e-02 


ASI 


FLK 


SUM TIB 
CUR 




11 
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113.36 


122.24 
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NEU 


hapFLK 
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NTX STX 


MSTNf 


42 
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239.76 
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RUNX3 


33 
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84.40 


86.40 
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FLK 
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120.91 


125.49 
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hapFLK 
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KITLG 
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122.07 


130.85 
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hapFLK 
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26 
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6 
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1.3e-02 


ITA 


hapFLK 


COM 
ALT SAB 




27 




4 


4.61 


6.61 


5.3e-06 


2.1e-02 


SWA 


FLK 


MOG 




8 
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8.50 


19.66 


4.2e-06 


l.le-03 


CEU 


hapFLK 


VBS VRS 




49 
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17.11 
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VBS 
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26.46 


28.46 


2.4e-05 


9.1e-02 


ASI 


FLK 


GUR IDC 


HDAC9 


6 


1 



SUM 
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Table 2 - continued from previous page 



4 


44.49 


45.76 


2.7e-04 


3.4e-02 
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hanFLK 


NZR 




12 
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FLK 


BOS ERS 




14 


1 
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10 


28.50 


30.50 


3.2e-05 


9.7e-02 


SWA 


FLK 


NDZ 




14 


1 


10 


28.50 


30.50 


1.3e-06 


5.4e-02 


SWE 


FLK 


MER 




14 


1 


10 


48.90 


49.59 


5.2e-04 


3.1e-02 


CEU 


hapFLK 






3 




11 


12.55 


14.12 


1.4e-04 


2.2e-02 


NEU 


hapFLK 






33 




11 


24.18 


38.74 


9.8e-09 


8.0e-05 


SWE 


hapFLK 


LAC 
MER 




296 




11 


40.31 


46.70 


3.3e-06 


5.5e-04 


ITA 


hapFLK 


SAB 




164 




12 


42.66 


44.66 


3.4e-07 


7.6e-03 


ASI 


FLK 


SUM 




10 




13 


33.10 


40.02 


5.7e-06 


1.8e-03 


AFR 


hapFLK 






41 




13 


40.60 


50.30 


4.9e-07 


4.9e-04 


AFR 


hapFLK 




BMP2f 


76 


1 


13 


43.34 


51.28 


2.7e-07 


1.7e-04 


SWE 


hapFLK 


LAC 
LAM 


PRNP 


49 


8 


13 


56.11 


57.17 


2.5c-08 


4.8e-04 


SWA 


hapFLK 


MOG 


EDN3 


19 


1 


13 


55.33 


57.43 


8.4e-ll 


l.le-06 


SWA 


FLK 


MOG 




19 


1 


14 


6.37 


13.60 


1.6e-04 


1.4e-02 


ITA 


hapFLK 


SAB 




70 




14 


13.64 


13.70 


5.3e-04 


4.9e-02 


NEU 


hapFLK 


ISF 


MC1R 


48 


33 


14 


13.70 


16.46 


1.2c-04 


l.le-02 


ITA 


hapFLK 


SAB 




37 


21 


14 


45.49 


50.09 


1.6e-04 


2.5e-02 


NEU 


hapFLK 


NTX NZR 




117 




15 


48.87 


50.87 


1.5e-05 


6.7e-02 


ASI 


FLK 


GAR IDC 




36 




15 


71.71 


73.71 


3.8e-06 


1.6e-02 


SWA 


FLK 


MOG 


ALX4/ 
EXT2 


13 


1/3 


16 


33.20 


35.10 


1.8c-04 


1.8e-02 


AFR 


hapFLK 




C6 / C7 


8 


5/7 


16 


63.97 


65.97 


l.le-05 


6.7e-02 


ASI 


FLK 


GAR IDC 




5 




19 


4.42 


7.43 


2.2e-04 


1.9e-02 


CEU 


hapFLK 


VRS BOS 


GLBlf 


17 


14 


19 


30.42 


35.09 


3.2e-05 


4.2e-03 


CEU 


hapFLK 


VBS BOS 
ERS 


MITFf 


14 


9 


19 


44.60 


46.60 


3.9e-06 


3.9e-02 


ASI 


FLK 


GAR 
BGA 


WNT5A 


4 


1 
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20 


36.74 


38.52 


2.8e-04 


2.3e-02 


CEU 


hapFLK 


VRS 




10 




22 


18.90 


24.36 


1.5e-ll 


7.4e-08 


NEU 


hapFLK 


GTX 


PITX3 1 


85 


5 


23 


42.50 


46.96 


2.2e-05 


5.4e-03 


AFR 


hapFLK 




MC2R/ 
MC5R 


35 


1/2 


23 


54.14 


56.14 


3.8e-07 


7.6e-03 


ASI 


FLK 


GAR 




5 




25 


0.08 


3.08 


3.7e-04 


2.4e-02 


ITA 


hapFLK 


SAB 




16 
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Table 3. Selection signatures in ancestral populations. SNP with significant FLK value at the 
5% FDR level, with estimated allele frequencies in all ancestral groups. The number of genes included 
in each region (1Mb up-or-downstream the position) and the rank of candidate genes within the region 
is also provided, f: signatures of selection previously identified [4]. 



Estimated ancestral allele frequencies 



OAR 


position 


AFR 


ASI 


SWA 


NEU 


CEU 


ITA 


SWE 


P-valuc 


Q- value 


Cetnd. gene 


Nb. 


Rank 


























genes 




1 


7192190 


0.15 


0.08 


0.16 


0.55 


0.69 


0.04 


0.38 


1.7e-06 


5.3e-03 


TRPM8 


19 


8 


1 


237070498 


0.87 


0.95 


0.91 


0.48 


0.24 


0.77 


0.35 


1.4e-05 


2.5e-02 


GYG1 


16 


5 


1 


239424807 


0.46 


0.68 


0.06 


0.21 


0.15 


0.11 


0.17 


3.4e-05 


4.8e-02 




9 




1 


239491620 


0.53 


0.41 


0.94 


0.86 


0.93 


0.93 


0.88 


4.3e-05 


5.6e-02 




9 




2 


45500785 


0.43 


0.91 


0.23 


0.76 


0.87 


0.87 


0.93 


2.2e-06 


6.4c-03 


LPL 


6 


3 


2 


182607165 


0.99 


0.97 


0.18 


0.64 


0.73 


0.83 


0.64 


3.4e-08 


1.8e-04 


INSIG2 


10 


3 


2 


182672296 


0.99 


0.94 


0.32 


0.90 


0.86 


0.89 


0.81 


7.7e-07 


2.8e-03 




10 




2 


192231314 


0.59 


0.93 


0.36 


0.96 


0.89 


0.81 


0.95 


1.6e-05 


2.8e-02 




8 




3 


132478420 


0.24 


0.89 


0.18 


0.93 


0.81 


0.84 


0.82 


1.2e-06 


3.9e-03 


HOXC f 


54 


l-s>9 


3 


180860403 


0.71 


0.53 


0.28 


0.82 


0.31 


0.12 


0.13 


1.7e-05 


2.8e-02 




22 




5 


15522700 


0.68 


0.63 


0.92 


0.27 


0.76 


0.99 


0.78 


9.8e-06 


2.0e-02 




51 




7 


89519883 


0.63 


0.61 


0.19 


0.89 


0.18 


0.60 


0.95 


6.1e-10 


5.2e-06 


TSHR f 


g 


3 




31 748642 


0.84 


0.93 


0.94 


0.16 


0.63 


0.47 


0.19 


2.8e-05 


4.1e-02 


PREP f 


g 


1 


11 


1 8248852 


0.35 


0.32 


0.82 


0.64 


0.94 


0.96 


0.92 


1.3e-05 


2.5e-02 


NF1 f 


23 


1 


11 


1 8325488 


0.87 


0.93 


0.00 


0.35 


0.04 


0.03 


0.04 


3.3e-16 


7.2e-12 




24 


4 


11 


18335747 


0.87 


0.93 


0.00 


0.35 


0.04 


0.03 


0.04 


3.3e-16 


7.2e-12 




22 


4 


11 


18433474 


0.87 


0.93 


0.02 


0.35 


0.07 


0.02 


0.05 


3.8e-15 


5.4c-ll 




22 


1 


11 


18440783 


0.78 


0.93 


0.02 


0.34 


0.07 


0.02 


0.05 


2.0e-14 


2.2e-10 




22 


1 


11 


25704651 


0.97 


0.96 


0.97 


0.42 


0.94 


0.94 


0.96 


8.5e-06 


1.9e-02 




73 




11 


26284826 


0.99 


0.97 


0.94 


0.38 


0.93 


0.95 


0.79 


3.2e-05 


4.6e-02 




100 




11 


26571629 


0.92 


0.94 


0.98 


0.29 


0.89 


0.88 


0.86 


1.8e-05 


2.8e-02 




115 




11 


26872280 


0.78 


0.71 


0.93 


0.15 


0.89 


0.90 


0.90 


2.2e-07 


9.5e-04 




111 




13 


12120674 


0.29 


0.84 


0.97 


0.91 


0.97 


0.92 


0.84 


7.7e-06 


1.8e-02 


GATA3 


6 


1 


13 


62857560 


0.52 


0.62 


0.65 


0.98 


0.67 


0.92 


0.36 


3.6e-06 


9.7e-03 


ASIP f 


32 


12 


15 


3706790 


0.71 


0.22 


0.96 


0.28 


0.27 


0.34 


0.21 


6.8e-06 


1.7c-02 




4 




15 


29856310 


0.98 


0.99 


0.99 


0.47 


0.92 


0.95 


0.96 


9.8e-06 


2.0e-02 




35 




16 


38696505 


0.95 


0.98 


0.95 


0.99 


0.68 


0.31 


0.30 


6.8e-07 


2.7c-03 


PRLR f 


18 


2 


17 


4867509 


0.91 


0.95 


0.85 


0.54 


0.18 


0.58 


0.17 


1.8e-05 


2.8e-02 


TMEM154 


9 


1 


18 


19342316 


0.90 


0.79 


0.67 


0.35 


0.75 


0.10 


0.09 


1.9e-07 


9.3e-04 


ACAN f 


31 


4 


18 


66470371 


0.99 


0.97 


0.90 


0.90 


0.18 


0.04 


0.08 


1.9e-09 


1.3c-05 


TRAF3 


28 


5 


20 


17381047 


0.24 


0.61 


0.97 


0.98 


0.93 


0.99 


0.91 


3.1c-08 


1.8e-04 


VEGFA f 


48 


1 
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